154 research outputs found
TensorFlow Doing HPC
TensorFlow is a popular emerging open-source programming framework supporting
the execution of distributed applications on heterogeneous hardware. While
TensorFlow has been initially designed for developing Machine Learning (ML)
applications, in fact TensorFlow aims at supporting the development of a much
broader range of application kinds that are outside the ML domain and can
possibly include HPC applications. However, very few experiments have been
conducted to evaluate TensorFlow performance when running HPC workloads on
supercomputers. This work addresses this lack by designing four traditional HPC
benchmark applications: STREAM, matrix-matrix multiply, Conjugate Gradient (CG)
solver and Fast Fourier Transform (FFT). We analyze their performance on two
supercomputers with accelerators and evaluate the potential of TensorFlow for
developing HPC applications. Our tests show that TensorFlow can fully take
advantage of high performance networks and accelerators on supercomputers.
Running our TensorFlow STREAM benchmark, we obtain over 50% of theoretical
communication bandwidth on our testing platform. We find an approximately 2x,
1.7x and 1.8x performance improvement when increasing the number of GPUs from
two to four in the matrix-matrix multiply, CG and FFT applications
respectively. All our performance results demonstrate that TensorFlow has high
potential of emerging also as HPC programming framework for heterogeneous
supercomputers.Comment: Accepted for publication at The Ninth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES'19
Characterizing Deep-Learning I/O Workloads in TensorFlow
The performance of Deep-Learning (DL) computing frameworks rely on the
performance of data ingestion and checkpointing. In fact, during the training,
a considerable high number of relatively small files are first loaded and
pre-processed on CPUs and then moved to accelerator for computation. In
addition, checkpointing and restart operations are carried out to allow DL
computing frameworks to restart quickly from a checkpoint. Because of this, I/O
affects the performance of DL applications. In this work, we characterize the
I/O performance and scaling of TensorFlow, an open-source programming framework
developed by Google and specifically designed for solving DL problems. To
measure TensorFlow I/O performance, we first design a micro-benchmark to
measure TensorFlow reads, and then use a TensorFlow mini-application based on
AlexNet to measure the performance cost of I/O and checkpointing in TensorFlow.
To improve the checkpointing performance, we design and implement a burst
buffer. We find that increasing the number of threads increases TensorFlow
bandwidth by a maximum of 2.3x and 7.8x on our benchmark environments. The use
of the tensorFlow prefetcher results in a complete overlap of computation on
accelerator and input pipeline on CPU eliminating the effective cost of I/O on
the overall performance. The use of a burst buffer to checkpoint to a fast
small capacity storage and copy asynchronously the checkpoints to a slower
large capacity storage resulted in a performance improvement of 2.6x with
respect to checkpointing directly to slower storage on our benchmark
environment.Comment: Accepted for publication at pdsw-DISCS 201
tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads
Machine Learning applications on HPC systems have been gaining popularity in
recent years. The upcoming large scale systems will offer tremendous
parallelism for training through GPUs. However, another heavy aspect of Machine
Learning is I/O, and this can potentially be a performance bottleneck.
TensorFlow, one of the most popular Deep-Learning platforms, now offers a new
profiler interface and allows instrumentation of TensorFlow operations.
However, the current profiler only enables analysis at the TensorFlow platform
level and does not provide system-level information. In this paper, we extend
TensorFlow Profiler and introduce tf-Darshan, both a profiler and tracer, that
performs instrumentation through Darshan. We use the same Darshan shared
instrumentation library and implement a runtime attachment without using a
system preload. We can extract Darshan profiling data structures during
TensorFlow execution to enable analysis through the TensorFlow profiler. We
visualize the performance results through TensorBoard, the web-based TensorFlow
visualization tool. At the same time, we do not alter Darshan's existing
implementation. We illustrate tf-Darshan by performing two case studies on
ImageNet image and Malware classification. We show that by guiding optimization
using data from tf-Darshan, we increase POSIX I/O bandwidth by up to 19% by
selecting data for staging on fast tier storage. We also show that Darshan has
the potential of being used as a runtime library for profiling and providing
information for future optimization.Comment: Accepted for publication at the 2020 International Conference on
Cluster Computing (CLUSTER 2020
sputniPIC: an Implicit Particle-in-Cell Code for Multi-GPU Systems
Large-scale simulations of plasmas are essential for advancing our
understanding of fusion devices, space, and astrophysical systems.
Particle-in-Cell (PIC) codes have demonstrated their success in simulating
numerous plasma phenomena on HPC systems. Today, flagship supercomputers
feature multiple GPUs per compute node to achieve unprecedented computing power
at high power efficiency. PIC codes require new algorithm design and
implementation for exploiting such accelerated platforms. In this work, we
design and optimize a three-dimensional implicit PIC code, called sputniPIC, to
run on a general multi-GPU compute node. We introduce a particle decomposition
data layout, in contrast to domain decomposition on CPU-based implementations,
to use particle batches for overlapping communication and computation on GPUs.
sputniPIC also natively supports different precision representations to achieve
speed up on hardware that supports reduced precision. We validate sputniPIC
through the well-known GEM challenge and provide performance analysis. We test
sputniPIC on three multi-GPU platforms and report a 200-800x performance
improvement with respect to the sputniPIC CPU OpenMP version performance. We
show that reduced precision could further improve performance by 45% to 80% on
the three platforms. Because of these performance improvements, on a single
node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC
simulations that were only possible using clusters.Comment: Accepted for publication at the 32nd International Symposium on
Computer Architecture and High Performance Computing (SBAC-PAD 2020
\u3ci\u3eMedicine Meets Virtual Reality 21\u3c/i\u3e
Editors: James D. Westwood, Susan W. Westwood, Li Felländer-Tsai, Cali M. Fidopiastis, Randy S. Haluck, Richard A. Robb, Steven Senger, Kirby G. Vosburgh.
Chapter, Varying the Speed of Perceived Self-Motion Affects Postural Control During Locomotion, co-authored by Joshua Pickhinke, Jung Hung Chien, Mukul Mukherjee, UNO faculty and staff members.
Virtual reality environments have been used to show the importance of perception of self-motion in controlling posture and gait. In this study, the authors used a virtual reality environment to investigate whether varying optical flow speed had any effect on postural control during locomotion. Healthy young adult participants walked under two conditions, with optical flow matching their preferred walking speed, and with a randomly varying optic flow speed compared to their preferred walking speed. Exposure to the varying optic flow increased the variability in their postural control as measured by area of COP when compared with the matched speed condition. If perception of self-motion becomes less predictable, postural control during locomotion becomes more variable and possibly riskier.https://digitalcommons.unomaha.edu/facultybooks/1261/thumbnail.jp
- …